Through generative models like Midjourney, DALL-E 2, and Stable Diffusion, an environment has been established where anyone can easily create convincing images using artificial intelligence. However, can diffusion models also be utilized for applications beyond images, such as videos or 3D modeling?
In this column, following the previous installment, we aim to analyze patents filed in various domains beyond images related to the actively researched "diffusion model" among generative models. We also intend to examine the research trends and prospects in diffusion model-related technologies.
Patents Utilizing Diffusion Models in the Audio Domain
Recently, there have been several attempts to apply diffusion models not only to the image domain but also to other data domains, such as video, 3D, and audio. Below, we will examine how diffusion models are being utilized in domains beyond images through a selection of sample patents chosen by PI IP LAW.
First, we introduce a patent filed and registered by Seoul National University, KR 2023-0032673, titled "Speech Synthesis System with Controllable Generation Speed," which employs diffusion models in the audio domain.
This patent pertains to a model for synthesizing speech that takes text as input and generates speech corresponding to the input text. The model comprises three main components: a step encoder, a text encoder, and a decoder. The decoder receives the n-th Gaussian noise and produces the (n-1)-th Gaussian noise. During this process, both the 'step embedding,' containing information about the diffusion time step, and the 'text embedding,' corresponding to the condition of the desired speech to be generated, are utilized. Step embedding and text embedding are separately encoded in distinct modules called the step encoder and text encoder. This configuration aligns with the typical approach of diffusion models for conditional generation.
Furthermore, in this invention, during the 'accelerated sampling process' of the step encoder, noise generated at each time step is learned from the training data, but upon completing training, the idea of DDIM (Denoising Diffusion Implicit Model) is adopted to increase data generation speed by skipping a certain number of time steps. By the way, what is DDIM?
With the introduction of DDPM (Denoising Diffusion Probabilistic Model), it became possible to generate high-quality data using the diffusion model. However, one of the central ideas of DDPM, the characteristic of a Markov chain, where 'x_t depends only on x_t-1,' necessitated generating intermediate data for all time steps during the data generation process, resulting in slow data generation speed. To address this issue, the DDIM (paper link) model was designed. In DDIM, instead of strictly adhering to the Markov chain characteristic, the forward process is designed to have x_t influenced not only by x_t-1 but also by x_0.
When generating data, DDIM predicts x_0 from x_t and can generate x_t-1 from the predicted x_0. Since there is a certain degree of consistency between predicting x_t-1 from x_0 and predicting x_t-2 from x_0, skipping intermediate steps and predicting x_t-2 directly from x_0 did not significantly degrade the quality of the generated data. If we consider T=1000, while DDPM required generating 1000 intermediate data points to generate the final data, DDIM only requires 500 by skipping one step and 250 by skipping two steps.
Returning to the patent, in this patent, the model is configured to allow skipping time steps while sampling as the size of a parameter (gamma) changes. Designing in this manner, if the number of gamma values that skip time steps is large, the time step interval becomes wide, resulting in lower speech synthesis quality. Conversely, if the number of gamma values that skip time steps is small, the time step interval becomes narrow, leading to higher speech synthesis quality. This design allows users to adjust the sampling speed and the quality of synthesized speech according to their needs. Additionally, even with a smaller model size compared to traditional speech synthesis models, sufficient quality in the synthesized speech can be achieved.
The independent claim of this patent, Claim 1, is structured as follows.
An audio synthesis system (100) capable of controlling the generation speed, comprising: A Text Encoder (110) that takes input in the form of text or phoneme sequences, and outputs text embeddings. A Step Encoder (120) that takes the diffusion time step as input and outputs step embeddings to inform the model of which time step it is modeling. A Decoder (130) that takes the n-th Gaussian noise as input and, as a condition, receives the text embedding (110) from the Text Encoder and the step embedding (120) from the Step Encoder, thereby producing Gaussian noise corresponding to a specific (n-1)th time step. This audio synthesis system enables the control of generation speed. |
The language of the claim itself is concise, and it includes only essential components required for a diffusion model that can control data generation speed, such as DDIM. Taking this into consideration, this patent appears to be a potent one with a broadly drafted scope of rights. Therefore, if one intends to utilize technology for synthesizing speech from text in their service, it seems crucial to undertake meticulous design to avoid infringement on this patent.
Patent Utilizing Diffusion Model in the 3D Domain 3D
Next, let's take a look at the patent registered under CN 116310153, titled "Single-view color 3D point cloud reconstruction methods, systems, storage media, and computers," filed and registered by NANCHANG HANGKONG University in China. This patent pertains to the use of diffusion models in the 3D domain and addresses the method for generating color point cloud data (a collection of points representing the surface of an object) of an object by taking a single-point-in-time image (2D) as input, even when the three-dimensional information of the object is not available.
The method for creating a color point cloud in this patent consists of the following steps:
1) Utilizing a diffusion model to generate a point cloud of an object from a single-point-in-time image of the object.
2) Utilizing the color information (color implicit code) of the single-point-in-time image of the object to generate color information for the point cloud.
3) Utilizing both the point cloud information and the color information of the point cloud to ultimately render a point cloud image.
Recently, several papers have been published that have achieved remarkable success in inferring the 3D information of objects from single-point-in-time images using recently trained diffusion models. This patent, too, has been filed with the consideration of diffusion models in 3D information inference, suggesting the potential for extending the capabilities of diffusion models.
The independent claim of this patent, Claim 1 (English translation), is as follows.
A single-view color three-dimensional point cloud reconstruction method, wherein, comprising: obtain any image of interest, and use an image editor to image edit the image of interest to obtain shape implicit encoding and color implicit encoding; Point cloud reconstruction is carried out based on the diffusion model and the shape implicit coding to obtain the target point cloud with the target shape, and the color estimation is made for the point cloud reconstruction according to the color implicit coding to obtain the point cloud color of each point cloud in the target point cloud; The sampling point position is obtained according to the camera parameters corresponding to the target point cloud, and the bulk density and radiometry of each sampling point location are calculated based on the target point cloud and the point cloud color of each point cloud in the target point cloud to render the corresponding predicted point cloud image; Taking the real object image as a condition, the point cloud color and point cloud shape of the predicted point cloud image are optimized, and the results of the optimized predicted point cloud image are fine-tuned to realize the three-dimensional point cloud reconstruction of the real object image. |
Looking at the content of the claim, it predominantly includes essential components for estimating color from 2D images of an object and generating point cloud data. While it includes a provision for "optimizing the reconstructed point cloud data," this can be interpreted broadly, suggesting that the actual scope of rights granted by this patent is not narrow. Therefore, similar to the patent discussed earlier, careful design is necessary to avoid infringement when using technology to generate point cloud data from images.
However, utilizing diffusion models in domains other than images is not a straightforward task. As referenced in the first column, the number of patents filed in domains other than images is significantly lower compared to the image domain. What properties of data make it challenging to apply diffusion models in domains beyond images? How can such challenges be addressed?
Utilizing Diffusion Models in the Video Domain
In this context, the last patent to be introduced is a Chinese public patent (publication number CN 115,761,593) titled "Action video generation method based on a diffusion model," filed by NANJING ZHILUN DIGITAL TECH. This patent covers content related to video generation using diffusion models.
Typically, diffusion models that generate images demonstrate a process of gradually adding noise to an image and then learning the correlations between pixels by gradually removing noise from the image. With learned correlations, the diffusion model can take noise as input and generate an image with noise removed.
However, videos differ significantly from images in that they consist of multiple images arranged in chronological order. To create natural-looking videos, one must consider not only the relationships between pixels within a single image but also the features of images before and after a specific frame, as videos are inherently temporal. Consequently, video generation is a relatively challenging task compared to image generation. For example, without considering spatiotemporal features, generating a video might lead to issues like "when an object is thrown far away, it should be represented smaller as it moves away from the throwing hand, but it appears larger, creating a sense of incongruity."
When using diffusion models designed for image generation to generate videos, there's a problem of predicting the states of actions that may occur in the next time step.
To address these challenges, the invention in question employs a method that acquires height and width ranges, as well as temporal and spatial features, for performing a desired action (action) using a 3D convolutional neural network based on videos containing the target action. This approach aims to learn and mitigate the issues associated with generating videos using diffusion models.
If a 2D convolutional neural network is used to extract features from videos, the output of the 2D convolutional network is 2D, which means that the temporal information of the video is not preserved. On the other hand, in the case of a 3D convolutional neural network, the output is in the form of a 3D volume, which allows for the preservation of not only height and width information but also temporal information in the video.
For example, in the case of a video where a ball is thrown,
When using a 3D convolutional neural network, in the case of a "video of a thrown ball in flight," as time progresses and frames pass, it becomes possible to capture temporal and spatial features such as the ball gradually getting smaller and the range of arm rotation not exceeding the length of the arm.
On the other hand, when utilizing a 2D convolutional neural network, in the case of a "video of a thrown ball in flight," only spatial features related to the positions of the hand and the ball can be obtained, as it does not consider the temporal order of events.
In other words, this invention, by using a 3D convolutional neural network to capture both the spatial features of the target for each desired action and the spatiotemporal features of the video, allows for the creation of natural videos that take into account the achievable height and width ranges, as well as temporal and spatial characteristics of the target's desired action during the video generation process.
An action video generation method based on diffusion model, wherein, including steps: S1. Collect the video of the target action, and preprocess the video to obtain the video frame sequence; S2. Identify the video frame sequence with the corresponding target in the video frame sequence; S3. Use three-dimensional convolutional neural network to extract regional features of the target and video spatiotemporal feature map; S4、Reconstruct the temporal series and spatial connection relationship of the target; S5. Identify the video frames of different timing of the target through the intelligent learning machine, and classify and name the target action; S6. According to the preset action video generation time, make dynamic videos with different naming time periods for the same target; S7. According to the classification and naming of the input target and action, output the dynamic video of the time period before and after the same target naming. |
Claim 1 is composed of four main steps as follows:
(1) Aligning the video frame sequence with the target and generating spatiotemporal feature maps of the target and video through a 3D convolutional neural network.
(2) Reconstructing the temporal order and spatial connections of the target, identifying video frames at different timings of the target through an artificial intelligence learning model, and classifying the target's actions.
(3) Generating videos of different timeframes for the same target according to preset action video generation times.
(4) Providing action videos of the same target for previous and subsequent timeframes based on the input target and action classification and names.
Unlike images, videos contain additional spatiotemporal features. Failure to consider these spatiotemporal features when generating videos can result in issues such as "when throwing a ball, the ball should appear smaller as it moves away from the throwing hand, but instead, it appears to grow larger."
This patent enables the acquisition of height and width ranges, as well as temporal and spatial characteristics of actions performed by the target, through the configuration of (1) aligning the video frame sequence with the target and generating spatiotemporal feature maps of the target and video through a 3D convolutional neural network.
For example, in the case of a video where a ball is thrown,
When utilizing a 3D convolutional neural network, in the case of a "thrown ball flying" video, the passage of time results in frames where the ball gradually becomes smaller, and the range of arm rotation does not exceed the length of the arm. This allows for the acquisition of temporal and spatial features.
Furthermore, (2) by reordering the temporal sequence and spatial relationships of the target and identifying video frames of the target at different timings using an artificial intelligence learning model, and classifying the actions of the target, the reconstructed target can prevent unnatural timing or spatial inconsistencies in video frames during subsequent playback. For instance, when throwing a ball (the target action), as the ball moves away from the hand, it can be represented as getting smaller, and the temporal sequence and spatial relationships of the thrown ball in the direction of the throw can be reconstructed.
Additionally, (3) by generating videos for different timeframes for the same target according to preset action video generation times and (4) providing action videos of the same target for previous and subsequent timeframes based on the input target and action classification and names, videos of actions with the desired target and target names can be obtained.
Specifically, through this patent,
Even in cases where there are only two frames of the "video of throwing a ball,"
You can obtain a longer version of the "video of throwing a ball."
Looking at the components of this patent, the configuration in (1), which generates spatiotemporal feature maps of the target and video using a 3D convolutional neural network, can generally be used in the process of acquiring spatiotemporal features from videos.
Additionally, the configurations in (3) that generate videos for different timeframes of the same subject based on pre-defined action video generation times and (4) that provide action videos from both before and after a specific time for the same subject based on the input subject and action classification and naming can be applied directly to methods that generate videos through text prompts, such as Imagen Video.
However, the configuration in (2), which reconstructs the temporal sequence and spatial connections of the subject and identifies video frames at different timings of the subject using an artificial intelligence learning model, can potentially be avoided by not performing the step of identifying video frames at different timings using an artificial intelligence learning model.
So far, we've examined patents that utilize diffusion models in the domains of audio, 3D, and video. To harness diffusion models in data domains composed of even more complex information than images, there is a need to address the challenges arising from the characteristics of these data domains. In recently filed patents, unique solutions to overcome these challenges have been introduced, expanding the potential utility of diffusion models.
The next column is the last column on the diffusion model, and based on the current situation, we would like to discuss the outlook for which patents related to the diffusion model will be filed in the future. We ask for your continued interest.